We are using a two-component hurdle model: first, the model predicts whether a disease will be present (binary), and if present, it predicts the case count (integer). Here we compare the results of a boosted tree model to our baseline model.

Disease Status

disease status confusion matrix
.metric desc model full_model
accuracy proportion of the data that are predicted correctly baseline 0.85
xgboost 0.96
kap similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. baseline 0.45
xgboost 0.88
sens the proportion of positive results out of the number of samples which were actually positive. baseline 0.99
xgboost 0.98
spec the proportion of negative results out of the number of samples which were actually negative baseline 0.36
xgboost 0.90
disease status confusion matrix by taxa
.metric model birds buffaloes camelidae cats cattle cervidae dogs equidae hares/rabbits sheep/goats swine
accuracy baseline 0.85 0.76 0.770 0.76 0.86 0.730 0.80 0.91 0.85 0.86 0.87
xgboost 0.95 0.96 0.960 0.97 0.95 0.970 0.95 0.97 0.96 0.96 0.96
kap baseline 0.42 0.20 0.130 0.38 0.56 0.059 0.52 0.42 0.20 0.47 0.42
xgboost 0.84 0.91 0.890 0.94 0.88 0.920 0.91 0.87 0.86 0.89 0.88
sens baseline 0.98 1.00 1.000 1.00 0.99 1.000 0.99 0.99 0.99 0.99 0.99
xgboost 0.97 0.97 0.970 0.98 0.97 0.980 0.96 0.99 0.98 0.98 0.98
spec baseline 0.34 0.15 0.094 0.32 0.49 0.043 0.48 0.31 0.14 0.38 0.32
xgboost 0.85 0.94 0.920 0.96 0.91 0.940 0.94 0.87 0.87 0.90 0.90
disease status confusion matrix by continent
.metric model Africa Americas Asia Europe NA Oceania
accuracy baseline 0.84 0.82 0.85 0.87 0.94 0.930
xgboost 0.95 0.96 0.96 0.95 NA 0.990
kap baseline 0.48 0.38 0.47 0.46 0.44 0.120
xgboost 0.88 0.91 0.89 0.84 NA 0.920
sens baseline 0.99 0.99 0.99 0.99 0.99 1.000
xgboost 0.97 0.98 0.98 0.98 NA 1.000
spec baseline 0.40 0.30 0.38 0.37 0.33 0.068
xgboost 0.91 0.93 0.91 0.85 NA 0.920
disease status direction change confusion matrix
.metric desc model full_model
accuracy proportion of the data that are predicted correctly baseline 0.850
xgboost 0.960
kap similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. baseline 0.052
xgboost 0.540
sens the proportion of positive results out of the number of samples which were actually positive. baseline 0.470
xgboost 0.590
spec the proportion of negative results out of the number of samples which were actually negative baseline 0.680
xgboost 0.810
Note there are baseline cases where disease status is positive but cases are NA, which are imputed in the model as 0.
disease status direction change confusion matrix by taxa
.metric model birds buffaloes camelidae cats cattle cervidae dogs equidae hares/rabbits sheep/goats swine
accuracy baseline 0.850 0.760 0.77 0.760 0.860 0.7300 0.800 0.910 0.850 0.860 0.870
xgboost 0.950 0.960 0.96 0.970 0.950 0.9700 0.950 0.970 0.960 0.960 0.960
kap baseline 0.064 0.032 0.04 0.025 0.042 0.0032 0.039 0.082 0.043 0.052 0.061
xgboost 0.390 0.670 0.66 0.770 0.510 0.7500 0.660 0.510 0.540 0.550 0.540
sens baseline 0.440 0.580 0.55 0.570 0.430 0.5700 0.510 0.480 0.460 0.470 0.480
xgboost 0.530 0.620 0.63 0.700 0.570 0.6900 0.630 0.570 0.610 0.580 0.580
spec baseline 0.690 0.660 0.67 0.660 0.670 0.6400 0.670 0.700 0.680 0.680 0.690
xgboost 0.760 0.860 0.86 0.900 0.800 0.9000 0.860 0.800 0.810 0.820 0.810
disease status direction change confusion matrix by continent
.metric model Africa Americas Asia Europe NA Oceania
accuracy baseline 0.840 0.820 0.850 0.87 0.940 0.930
xgboost 0.950 0.960 0.960 0.95 NA 0.990
kap baseline 0.036 0.025 0.059 0.09 0.065 0.039
xgboost 0.540 0.590 0.550 0.49 NA 0.530
sens baseline 0.450 0.470 0.470 0.46 0.430 0.610
xgboost 0.560 0.580 0.600 0.58 NA 0.540
spec baseline 0.670 0.670 0.680 0.69 0.680 0.700
xgboost 0.810 0.830 0.820 0.79 NA 0.810

disease status variable importance and partial dependency (xgboost only)

##                            Feature        Gain       Cover   Frequency
##  1:            disease_status_lag1 0.776430287 0.054115316 0.033185841
##  2:             cases_lag1_missing 0.049485755 0.043642648 0.025073746
##  3:       ever_in_country_any_taxa 0.048458073 0.050788831 0.019911504
##  4:            disease_status_lag2 0.017900276 0.036354726 0.020648968
##  5:           log_human_population 0.013270873 0.044711771 0.135693215
##  6:             cases_lag2_missing 0.009267981 0.006659092 0.016961652
##  7:        disease_population_wild 0.009183998 0.014212483 0.007374631
##  8:             log_gdp_per_capita 0.009140356 0.023998585 0.109882006
##  9:            disease_status_lag3 0.006213909 0.034683627 0.016961652
## 10:             cases_lag3_missing 0.005740374 0.017600986 0.014749263
## 11:            log_taxa_population 0.005057817 0.020693720 0.061209440
## 12: cases_lag_sum_border_countries 0.004211446 0.047838273 0.044985251

Cases

Here we evaluate the subset of the training data with positive case counts

cases model stats

## # A tibble: 6 x 4
##   model    .metric .estimator  .estimate
##   <chr>    <chr>   <chr>           <dbl>
## 1 baseline rmse    standard   210751.   
## 2 xgboost  rmse    standard   260813.   
## 3 baseline rsq     standard        0.371
## 4 xgboost  rsq     standard        0.144
## 5 baseline mae     standard     1787.   
## 6 xgboost  mae     standard     2295.
cases residuals
cases residuals by taxa
cases residuals by continent

cases variable importance and partial dependency (xgboost only)

##                            Feature        Gain       Cover   Frequency
##  1:            log_taxa_population 0.393012236 0.180119640 0.062023939
##  2:                     cases_lag1 0.200759320 0.203609619 0.205658324
##  3:              country_iso3c_IRQ 0.130015681 0.063435733 0.005440696
##  4:             log_gdp_per_capita 0.039913259 0.049233161 0.054406964
##  5:   disease_mycoplasma_infection 0.036270814 0.025614941 0.003264418
##  6:           log_human_population 0.030280061 0.039263981 0.079434168
##  7:                     cases_lag2 0.027312253 0.045814607 0.102285092
##  8:     log_veterinarians_per_taxa 0.026401802 0.052468420 0.059847661
##  9:              country_iso3c_RWA 0.016300578 0.027103687 0.009793254
## 10: cases_lag_sum_border_countries 0.008151194 0.028439215 0.071817193
## 11:              country_iso3c_VNM 0.006341258 0.015041983 0.004352557
## 12:                     cases_lag3 0.006169676 0.008213044 0.068552775

cases partial dependency by select disease (xgboost only)